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Abstract 

Challenging aspects of large-scale turbulent edge simulations in plasma physics include robust nonlinear 
solvers and efficient preconditioned. This paper presents recent advances in the scalable solution of non- 
linear partial differential equations in BOUTH — h, with emphasis on simulations of edge localized modes in 
tokamaks. A six-field, nonlinear, reduced magnetohydrodynamics model containing the fast shear Alfven 
wave and electron and ion heat conduction along magnetic fields is solved by using Jacobian-free Newton- 
Krylov (JFNK) methods and nonlinear GMRES (NGMRES). Physics-based preconditioning based on an- 
alytic Schur factorization of a simplified Jacobian is found to result in an order of magnitude reduction in 
runtime over unpreconditioned JFNK, and NGMRES is shown to significantly reduce runtime while requir- 
ing only the nonlinear function evaluation. We describe in detail the preconditioning algorithm, and we 
discuss parallel performance of NGMRES and Newton-Krylov methods using the PETSc library. 

Keywords: physics-based preconditioning, edge localized modes, Newton-Krylov, nonlinear GMRES 



1. Introduction 

The international community is committed to the development of fusion energy in order to meet growing 
world energy needs. This is a complex, international, and costly endeavor; the next experimental magnetic- 
fusion plasma confinement device, ITER [T|, is estimated to cost in excess of $10B. The handling of the plasma 
exhaust is a challenge for the operation of ITER and for the design of the future demonstration power plant 
DEMO. To achieve its design goals, ITER must operate in a high-performance regime with steep pressure 
gradients close to the plasma edge [2J [3] . While this capability improves the plasma performance, the steep 
pressure gradients and associated bootstrap current can destabilize peeling-ballooning modes jlHB]. These 
result in eruptions from the plasma edge, known as edge localized modes (ELMs) (see, e.g., [7])- Scaling 
from existing experiments indicates that these violent events, if uncontrolled, could produce peak power 
loads of around I GW/m 2 onto material surfaces in ITER [9] and potentially more in a DEMO power 
plant. Such loads would result in unacceptable erosion of plasma facing components. Thus, researchers are 
extremely interested in understanding and controlling ELMs, and work is ongoing to use several nonlinear 
simulation codes in order to simulate ELMs, predict their size in ITER, and find ways to mitigate them. 

Nonlinear simulations of tokamak plasmas are challenging because of the wide range of length scales 
and timescales present: The characteristic ion cyclotron motion has scales around I mm and 10 -7 seconds, 
while the machine has length scales of ~10 m and transport timescales in seconds. If electron physics is 
important, then the characteristic scales become smaller by the square root of the mass ratio, a factor of 
~60. These small cyclotron scales are due to the strong magnetic field (typically 1 — 5 Teslas), which is 
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used to confine the plasma in a tokamak. This produces strong anisotropy in the system, as particles are 
confined perpendicular to the magnetic field but are essentially free to move along it, resulting in a difference 
of 10 s in heat conduction rates perpendicular and parallel to the field. The topology of the magnetic field 
poses challenges for simulation, as the core fieldlines form helices with a pitch angle that varies across the 
device. At the boundary of the core plasma, in the region we are interested in simulating, a separatrix 
is formed, outside of which the fieldlines intersect material boundaries rather than forming closed toroidal 
"flux" surfaces. Furthermore, the edge region of tokamaks is complicated by the interaction with neutral 
gas and material surfaces, bringing in atomic physics, sputtering, and deposition processes. 

Analytic reduction of the full problem, employing asymptotic expansions in small parameters such as 
the ratio of the Larmor radius to machine size, has been successfully used for many years (see, e.g., |10j and 
references therein). In this paper we consider fluid equations, which evolve moments of the particle velocity 
distribution, and remove the smallest scales associated with cyclotron motion Further simplifications 
include reduced magnetohydrodynamics (MHD), exploiting the anisotropy of the system in which length 
scales parallel to the magnetic field are much larger than those perpendicular to the field (see, e.g., |T2|, fT3]). 
By assuming force balance, this approach allows us to analytically remove the fast magnetosonic wave, 
which would be the fastest wave in our system. This wave is not important for the relatively slow MHD 
instabilities we wish to study here, but it is responsible for establishing force balance along the magnetic 
field. After this reduction, the system of equations still contains two fast timescales that cannot be removed 
without also removing physics of interest: the shear Alfven wave and the parallel heat transport. These can 
severely limit the timestep in numerical simulations, so implicit and semi-implicit methods are widely used 
in plasma simulation [T3]. In BOUT++ edge simulations of ELMs [15--18J, such techniques help address 
challenges in the broad range of temporal scales, complicated geometry with a range of spatial scales, and 
competing physical effects of the various degrees of freedom. 

This paper focuses on efficient and scalable nonlinear solvers for BOUT++ ELM simulations. We 
construct a physics-based preconditioner that reduces the time to solve nonlinear systems with Jacobian-free 
Ncwton-Krylov (JFNK) techniques |19j . The term physics-based preconditioner has been used to describe 
various approaches to application-specific preconditioners that incorporate insight into the physics behind 
the equations being solved [2DH22] • Inspired by the approach of Chacon [57] , we use the term here to describe 
our development of a custom preconditioning algorithm based on analytic reduction and Schur factorization 
of the system Jacobian. Identifying the key physics terms allows a simplification of the resulting equations 
and enables efficient numerical solution. A similar approach has been previously attempted within the 
NIMROD code [25], where a Crank-Nicholson (CN) Jacobian-free Newton-Krylov scheme was used along 
with physics-based preconditioning. In that work numerical instability was encountered and attributed 
to the CN scheme. In this paper we exploit the field-aligned coordinate system used in BOUT+-I- [15] to 
construct a preconditioner that reduces a large 3D problem to the solution of many one-dimensional systems, 
each of which can be solved independently. This approach greatly improves convergence of Newton-Krylov 
methods in implicit timestepping schemes, without causing numerical instability. 

We also introduce the complementary approach of nonlinear GMRES [55] , which offers the advantage of 
requiring only nonlinear function evaluations from the application, rather than also needing Jacobian- vector 
products and preconditioning for Newton-based methods. We demonstrate that both of these approaches — 
physics-based preconditioned Newton-Krylov and nonlinear GMRES — significantly reduce the time for solv- 
ing the nonlinear ELM systems at each timestep compared with a Newton-Krylov approach with no pre- 
conditioning, and we demonstrate good strong scaling on up to 512 processor cores. 

The remainder of this paper explains the details of our approach. Section [5j introduces the BOUT++ 
application and its approach to modeling ELMs. Section [3] describes the algorithms and software that we use 
to solve the resulting nonlinear systems that arise at each timestep, including both Newton-Krylov methods 
with physics-based preconditioning and nonlinear GMRES. Section [4] presents experimental results, while 
Section [5] discusses conclusions and directions for future work. 
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2. BOUT++ Simulation Overview 



The numerical methods studied in this paper are quite general, but are applied here to the solution of 
reduced MHD for fusion applications within the BOUT++ framework. 

2.1. BOUT++ Capabilities 

BOUT++, a parallel MPI + OpenMP finite difference code written in C++ [T5], simulates a range of 
plasma fluid equations in curvilinear coordinate systems — in particular, reduced MHD models in coordinates 
aligned with a strong background magnetic field. Current applications of BOUTH — h include the study of 
ELMs [16-18 , plasma turbulence [33], and blob dynamics [3U]. in this paper BOUT++ is extended to 
more efficiently simulate heat conduction parallel to magnetic fields, an important process in all these areas. 
Parallel heat conduction modifies the linear structure and growth rate of plasma instabilities and plays a 
crucial role in the loss of energy to material surfaces in tokamak edge simulations of turbulence and ELMs. 

2.2. Reduced MHD Equations 

Since the linear stability threshold is well described by ideal MHD [5 , it is reasonable to start studying 
ELMs by using fluid models based on MHD. Several initial- value codes including BOUT++ [T7], M3D- 
C 1 [3T] and NIMROD [32], have now been benchmarked and shown good agreement with linear ideal 
MHD codes ELITE |B] and GATO [33J. Nonlinearly, ideal MHD theory predicts narrowing structures 
erupting outwards, leading to a finite time singularity [31]. Long before this happens, additional effects must 
become important in order to allow plasma to decouple from magnetic fields and relax to a new equilibrium. 
Nonlinear modeling of ELMs has therefore concentrated on fluid models incorporating a range of different 
nonideal effects [32 [321 [35H35] . 

Since each simulation code currently uses different fluid models, nonlinear benchmarking of codes is 
difficult. In order to enable future comparisons between codes, a set of equations close to those used in 
JOREK [39] has been implemented in BOUT++, which we will refer to in this paper as the six-field model. 
Six scalar fields are evolved: mass density p; electron and ion temperatures T e and T^, respectively; vorticity 
f/ = b -Vxv~ ^^i^; parallel velocity i>| | = b v; and parallel vector potential A\\ =b -A. From these 
auxiliary quantities are calculated the total plasma pressure P — p(T e + 7$) and parallel current density 
J|| = bo • J = — Vj_A||. The equilibrium magnetic field unit vector is bo = Bo/-Bo- Here all quantities with 
subscript "0" are calculated from the starting equilibrium and are not evolved. The evolving quantities are 
deviations from equilibrium; for example, the total mass density is po + p. The equations for plasma density 
and temperatures are 

^ = -v • V (p + po) + (V • v) (p + po) + D^Vlp 

dT 2 

= -v • V (T s + T s0 ) - - (V • v) (T s + T s0 ) 

+ [V„ • (x.,rt,T.) + Xs,VlT s ] + j^W s , (1) 

where the total plasma velocity v = ^vekb + bun is given by an E x B drift perpendicular to magnetic 
fieldlines v £xB = g^b x and parallel flow U|| along them. The above equations describe the advection 
(v ■ V) and compression (V • v) of density and temperatures (s = e,i), collisional diffusion perpendicular 
to the magnetic field with coefficients D± and Xs±, and parallel conduction of heat along magnetic fields 
with coefficient Xs\\- Derivatives are split into those parallel to the magnetic field V|| = bo ■ V and those 
perpendicular Vj_ = V — b ■ V. 
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The velocity of the plasma is described by the vorticity and parallel flow: 



dU „ r7 1 

-v-VU 



dt p + po 

+ 2b x k • VP 4 
dv\\ „ 1 



S 2 V|| 



J ll + J I|Q 
B 



2b x k • VP + vudfJJ + is ± V 2 j_U 



= - v - Vw l|-7^T V ll P - ( 2 ) 



5* 11 P + Po 



The vorticity equation (evolving U) contains a term responsible for shear Alfven wave propagation (Bq V|| jf-), 
instability drive due to equilibrium magnetic field curvature n and pressure gradient VP, as well as parallel 
and perpendicular viscosity terms v\\ and 

The evolution of the magnetic field is described by Ohm's law for the electric field parallel to the magnetic 
field b • E = b • 77J, 

-J! = -V,|*-„J„, (3) 

where rj is the parallel resistivity. 

2.3. Timescales for Shear Alfven Wave and Parallel Heat Conduction 

The timescales of interest in equations are those of the linear growth of peeling-ballooning modes 

and nonlinear eruption of filaments, typically of the order of ^100/zs up to milliseconds. The above set of 
equations contains timescales much faster than this, which must be preconditioned in an implicit scheme. 
The vorticity and A|| equations contain the terms 

dU B 2 /J|,\ OA 



which reduce to 

d 2 cj) 



= V 



— 2 
± 



R 3 



V|| (ViV|| 
Po 



B, 



^Vf,c 
Po 11 



corresponding to the shear Alfven wave traveling along magnetic fieldlines. Given the strong magnetic field 
B and low mass density p$ in tokamaks, wave speeds of w J 4~10 7 m/s are typical, which for parallel grid 
spacing of ~10 cm —1 m give CFL timesteps ^10 _8 s, 3 to 4 orders of magnitude shorter than timescales of 
interest. 

The second fast timescale is introduced through the ion and electron temperature equation 

dT 1 

which is a diffusion equation along magnetic fieldlines. At the high temperatures relevant to fusion, electron 
thermal speeds along fieldlines of 10 6 — 10 7 m/s are typical, and hence parallel thermal diffusion is also fast 
compared with nonlinear evolution timescales. 

The above equations contain one other wave along magnetic fieldlines: the sound wave described by 

dv\\ 1 dp dT s 2 

The speed of this wave is typically around 10 5 m/s for the problems in which we are interested, several orders 
of magnitude slower than the Alfven wave and parallel electron heat conduction. It is trivial to extend the 
preconditioner described below to include this wave, but this extension was found to make little difference 
to the convergence rate and so is not included here. 

In the following sections we outline the implicit timestepping method and construct a preconditioning 
scheme that can be applied to address both the Alfven wave and temperature conduction timescales. 
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3. Scalable Nonlinear Solvers 



The coupled set of reduced MHD equations ([l}-([3]) are discretized by the method of lines, using 4th- 
order central differencing for diffusive terms, and 3rd order WENO [ID] for upwinding. Because this work 
focuses on nonlinear solvers, we chose the timestepping algorithm to be the same for each test case, namely, 
a variable order backward differential formulation (BDF) (see, e.g., [41]). This family comprises methods 
that are implicit and linear multistep, and that allow variable timestepping. At each timestep, by writing 
the system state as a vector, we obtain from equations ([l])-(|3]) a nonlinear system of the form 



/(«) = / 



/ P \ 

Ti 

T e 
U 

v \\ 

V A \\ J 



(4) 



where / : E" -> 1 
NGMRES methods. 



This requires the solution of a nonlinear system, here using Newton-Krylov and 



3.1. Newton-Krylov Methods 

In a Newton-Krylov method, we solve the nonlinear system Q at each timestep using a Newton iteration 
(see, e.g., 02]), 

U k + l = U k - [/'(Ufe)] _1 /("/c): k = 0, 1, 

Here uq is an initial approximation to the solution, and f'(uk), the Jacobian, is nonsingular at each iteration. 
A Krylov iterative method (here GMRES) is then used to solve this linear system at each Newton iteration. 
In practice, the Newton iteration is implemented by the following two steps: 

1. (Approximately) solve f'(u k )Au k = -f(u k ). 

2. Update u k +i = u k + aAu k - 

Here < a < 1 is a scalar. In these experiments, we use a JFNK variant, where Jacobian- vector products 
are computed matrix-free |19| by an approximation 

,// \ _ f(v + h- a) - f(v) 
/(«)a« ~ h , 

for a differencing parameter h. By computing finite-difference Jacobian-vector products for the iterative 
solve, assembling the actual Jacobian is never required. 

If the nonlinear system being evolved contains a wide range of timescales, as in most plasma simulations, 
then it is said to be "stiff," and the Jacobian can be ill-conditioned. In this case the Krylov method will 
require an unacceptably large number of iterations. Preconditioning the linearized Newton systems, which 
have the form JTac = 6, where J is the nonsingular n x n Jacobian matrix and x and b are n-dimensional 
vectors, transforms these to equivalent systems PJx = Pb through the action of a preconditioner, P, whose 
inverse action approximates that of the Jacobian but at smaller cost. Choosing an effective approximation 
is one of the goals of this work, since the preconditioning phase is crucial to achieving low computational 
cost and scalable parallelism. 



3.2. Physics-Based Preconditioning 

Physics-based preconditioning [21] [27l [43] is a method by which insight into the physics behind the 
equations being solved is used to precondition the Newton-Krylov method. After first identifying the stiff 
terms in the equations, an approximate inverse is constructed with block Gaussian elimination (Schur 
factorization) of the simplified analytic Jacobian. The resulting matrices have a parabolic form, even if 
the starting system is hyperbolic, as a result of the implicit timestepping procedure. Multigrid methods are 



5 



often used to solve these systems (see, e.g., [17]); but these are difficult to implement in complex geometries 
and an existing codebase that was not designed with them in mind. Here we describe a simpler scheme that 
can be used to efficiently solve the factorized Jacobian in field-aligned coordinates without the complexity 
of a multigrid method. 

To derive a preconditioner, we begin by simplifying the nonlinear six- field system, equations Q-Q. 
retaining the linear vorticity terms driving perpendicular plasma E x B motion, the fast parallel Alfven 
wave that propagates the motion along ficldlincs, the linear terms describing perpendicular advection of 
density and temperature, and the parallel diffusion of temperature perturbations. This procedure reduces 
the equations to 



dp 
dt 

dt 

dU 

dt 

9v \\ 
dt 



1 

Po 



1 



v ll ' (Xs\\9\\T s ) 



£oV|| 



Ji 
Bo 



Po 

+- 2b x k 



VP 



dA 



II 



dt 



-V,|0. 



(5) 



We have therefore neglected all nonlinear terms (those involving two or more evolving quantities), all 
perpendicular diffusive terms, viscosity, and resistivity. The neglect of nonlinear terms is justified because 
we wish to follow the nonlinear dynamics and thus should not step over these timescales; instead, we focus 
on preconditioning the fast linear dynamics. Perpendicular diffusion and resistive effects are also slow 
relative to the parallel dynamics and so can be dropped. We stress that these simplifications made in the 
preconditioner do not affect the solution to the full six-field model, only the convergence rate toward it. 
When the assumptions made here are violated, the preconditioner becomes less effective, but the solution 
is not compromised (provided that the Krylov method still converges and does not stall). 

In an implicit timestepping scheme (Section 3.1 ) the operator to be inverted is A = I — 7J, where J is the 
system Jacobian; 7 is usually proportional to the timestep and is determined by the choice of integration 
scheme. For preconditioning, a simplified form of the Jacobian is derived from Equations ([5| and is given 
in Equation 



-bxK-V[(T e0 +T i0 ) 





PO II 





PO 



V[po 






po II 








po 



V[po 



Following the methodology of [22], we first factorize the matrix 



A" 1 = (I — 7J) 1 = 



" E U ~ 


-1 


' I E^U " 




L D 




I 





E- 1 





n 





-BoV, 







x Vp ■ vv; 

x VT i0 ■ VV 
jbx VT e0 





-v,|v: 




2 [Bo 
2 [Bo 
VVX 2 [B 

2 [Bo 



(6) 



7J using Schur factorization: 
I 



-LE" 



(7) 



Schur . 

where L is a row vector containing the vorticity drive terms due to curvature and parallel current, U is a 
column vector containing the linear E x B advection terms, and E contains the parallel heat conduction 



terms, which have the form 
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The matrix . 



V l|- 



Schur 



= D — LE U is the Schur complement; and since nonlinear terms are neglected, D is 



the identity matrix. To simplify the form of fschur, we drop terms involving only perpendicular derivatives, 
leaving only the shear Alfven wave: 



chur 



7 2 ^V||ViVnV 
Po 



-2 



In order to remove the perpendicular derivatives, the following approximation is used: 



V||Vivnv: 2 



II 



2V|| (RBe) 

RBg 



where R and Bg are the equilibrium major radius and poloidal magnetic field, respectively. 

The above factorization contains three applications of E _1 ; however, it was found that only the second 
had a significant impact on the convergence rate, and so the first and last were omitted, significantly speeding 
the calculation. The resulting preconditioning operator can be divided into three stages, P = P3P2P1, and 
written as an operation on a unpreconditioned state vector v^ u \ which produces a state vector v^ p ' = Fv^ u '. 
In the following description, superscripts on state vectors or components of state vectors will indicate the 
stage of the preconditioning operation. 

Note that in the simplest backwards Euler implicit scheme when P = A -1 , the vector v<M is the solution 
at the last timestep, and the solution at the next. The preconditioner can therefore be interpreted as a 
predictor step, calculating an approximation of the state vector at the next timestep from the previous one. 
The application of P, Equation is performed as follows. 



1. Given an unpreconditioned vector w^ 7 ' = (p^ u \T^ U \T ( 
in Equation Q: 



(U) (U) 
«|| . 



A ( pM u) 



apply the first matrix 



.(-Pi) - 



I 



where the E 1 operator has been dropped, as discussed above. When this is expanded, only the 
vorticity field is modified: 



UW = l/M + 7— V,| jV ] + 7 -2b x K ■ VPM, 



Po 11 Po 

( u > (Tj + T e0 ). This operation represents a 



where J™ = -V 2 ± A\p and = p (t^ +Ti 'j+p 
forward Euler step to predict the vorticity after a timestep 7. 
2. Apply the second matrix, solving for parallel temperature conduction and shear Alfven wave dynamics, 



,(P2) 



E" 





Schur 



which can be written as 



T (P2) 

s 

UiP) 



Po 

, 2 



T (U) 



2 1 2 V ll (RB e ) 

RBgBf. 



R2 

Po 11 



Since all these operators are of the form ya + bV^j with different scalars a and b, the same parallel 
solver can be used. Since the equations for temperature and vorticity are decoupled, a further opti- 
mization, not used in this current work, would be to also solve the three equations at the same time. 
This operation represents a backward Euler step for the propagation of vorticity and temperature along 
magnetic fieldlines over a timestep 7 and is the most computationally demanding. Efficiently solving 
these systems is critical to the performance of the preconditioner, so the method used is described in 
detail below. 

Apply the third matrix, updating the other fields based on the new vorticity 



-U 



,(P2) 



7 



where again the E 1 operator has been dropped. This can be written as 

p (P) _ p (u) _ 7V £) s . VpQ 

TP = - 7vg B ■ VT s0 

where v^xb = ib^o x i s the perpendicular E x B velocity calculated by using 0^ p ' = 

Vj 2 (B U( p ^, the electrostatic potential calculated from the updated vorticity. This represents an 
implicit update of p,T s ,Au, advecting plasma quantities based on an approximate E x B velocity at 
the next timestep. 

The only variable that is not modified in any of these operations is the parallel velocity, so d^' = Un . 

The preconditioner described above contains the key physics responsible for fast timescales in the system, 
and it has greatly reduced the complexity and computational cost of calculating P. These features can be 
seen by considering the sizes of the matrices involved. Since 6 variables are evolved on N mesh points, the 
matrix A to be inverted has (6iV) 2 elements, most of which are however zero. A typical simulation will use 
on the order of 100 points in each dimension (between 64 and 512 in most cases, though the radial mesh size 
Nif, can be several thousand for ITER simulations) and so have ~10 6 mesh points, requiring the inversion 
of a (6 x 10 6 ) 2 matrix at each timestep. By block factorizing and keeping only the fast components, we 
have reduced this to solving three separate matrices (in E and Pschur), which can be performed in parallel, 
each of which is of size (^V) 2 ~ (10 6 ) 2 . This approach in itself does not represent a large savings; but 
by exploiting the structure of these blocks in the field-aligned coordinate system employed by BOUT++, 
further simplifications can be made. 

Since each block contains only derivatives along magnetic fieldlines after we remove perpendicular deriva- 
tives, and equilibrium magnetic fieldlines form helices around a torus covering 2D "flux" surfaces, we can 
independently solve for each flux surface, immediately reducing the (10 6 ) 2 inversion problem to ~100 sep- 
arate (10 4 ) 2 problems. This assumption of equilibrium flux surfaces will break down if the magnetic field 
becomes significantly perturbed in the later stages of an ELM crash, but this is not a critical problem 
because at these times nonlinear dynamics are fast and a smaller timestep is unavoidable. 

To further simplify the problem, we exploit the toroidal symmetry of the equilibrium and decompose into 
toroidal Fourier harmonics. This approach decouples the toroidal and poloidal directions, splitting the 2D 
problem into a set of one-dimensional equations in poloidal angle, with a complex phase shift representing 
the pitch of the fieldlines. If we started with flux surfaces, each containing Ng poloidal points and 
toroidal points (so N — x Ng x N^), then we are left with x (iV^/2) ~ 10 4 complex cyclic tridiagonal 
systems, each of length Ng ~ 10 2 . These are then solved efficiently in parallel by using a variant of the 
cyclic reduction algorithm with interface equations [44]. Since the three equations being solved are also 
independent, all 3 x N$ x (JVf/2) tridiagonal systems can be solved simultaneously, resulting in a good 
overlap of communication and computation on a large number of processors. 

3.3. Nonlinear GMRES 

As an alternative to preconditioned Newton-Krylov methods, we consider nonlinear GMRES (NGMRES) 
to solve the nonlinear system Q. NGMRES and other similar methods (quasi-Newton, Anderson mixing) 
of solution for nonlinear systems are analogous to multistep Krylov methods used for linear systems [281 145] . 
At each iteration, an updated solution is constructed as a linear combination of several stored previous 
solutions and a new candidate solution. A minimum-residual linear combination is found by solving a small 
minimization problem. 

The new candidate solution is constructed by perturbing the current solution by some negative damping 
of the current residual, typically found through a line search. One may also construct the candidate using 
a small number of iterations of some other nonlinear solution technique. For the present problem we use a 
damping factor of —0.001 on the residual to construct the candidate and store 50 previous solutions. By 
using these Krylov subspace accelerators, we aim to reduce the overall CPU time for each nonlinear solve 
and also reduce memory usage. 
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4. Experimental Results 

We use the nonlinear solvers in the PETSc library [151 HZ] to solve these nonlinear systems over varying 
numbers of processors. The library provides a simple interface for nonlinear solvers that enables selection 
of algorithms and parameters at runtime, with no changes to the application code. 

Our experiments were run on the Fusion cluster at Argonne National Laboratory. This cluster has 320 
compute nodes, each with an Intel Nehalem dual-socket, quad-core 2.53 GHz processor (or 2,560 compute 
processors), connected by Infiniband. 

We specified a relative nonlinear convergence tolerance of 10 -3 for all methods, which is consistent to 
what is typically employed for implicit timestepping. For each Newton-Krylov run, we used GMRES with 
a restart of 30 and a relative linear convergence tolerance of 10 -2 . The method NGMRES, by definition, 
has no linear solver associated with it. Consequently, each nonlinear solve using NGMRES has many more 
iterations, in contrast with the Newton-Krylov approaches, where an average of two or three nonlinear 
iterations achieved convergence. 



4-1. Convergence of Nonlinear Solvers 

To test the feasibility of our approach to physics-based preconditioning, we conducted initial experiments 
with a simpler problem: a two-field subset of the six-field model that contains only the Alfven wave, 



-V\\<f>-vJ\\ 



d_A^ 
dt 



with auxiliary equations 



U=l-V 2 ,6 Jn=-ViA 



D 



Though greatly simplified, this model can describe the essential physics of magnetic reconnection and island 
formation in tokamak plasmas 48 . Thus, the efficient solution of this system of equations is of interest 
in itself. We used a rectangular (slab) grid of dimension 68 x 32 x 16. As seen in Figure [l] physics-based 
preconditioned JFNK showed a significant reduction in overall runtime in comparison with the default 
(JFNK with no preconditioning). NGMRES also significantly decreased overall runtime. 

For the remaining experiments, we used the six-field model, Equations Q-Q, on a grid size of — 64, 
Ne = 64, and N$ = 128, partitioned only in the ip6- plane because of using the discrete Fourier transform 
in the £— direction. Each example has six coupled independent variables (p, T e , Ti, U, Vii, An), as given by 
Equation Q. One reason for using a relatively small grid is to test the scaling of these algorithms when the 
number of grid cells per processor becomes small. Since domain decomposition is in ip and 9, 512 processors 
(the maximum used here) will divide the mesh into 4 x 2 x 128 blocks. This is a more useful test than simply 
using a large mesh to demonstrate scaling. 

As shown in Figures [2a] and |2b[ physics-based preconditioning achieves an order of magnitude decrease 
in overall time for the Newton-Krylov nonlinear solve and nearly two orders of magnitude reduction in the 
number of Krylov iterations in comparison with the unpreconditioned variant. NGMRES also significantly 
reduces overall time for the nonlinear solve compared with the BOUT+-)- default of JFNK with no precon- 
ditioning. Note that while the nonlinear relative convergence tolerance for each run is 10 -3 , the final value 
of the function norm for the Newton-Krylov runs is much smaller compared with the NGMRES run because 
of the substantial decrease in function norm at each Newton step. While the physics-based preconditioner 
code is hand-written for each type of physics model, NGMRES requires no application-specific code and can 
be activated at runtime, making it an attractive alternative for efficient and scalable nonlinear solves. 



At early simulation times (t = 0, Figure 2a), nonlinear terms in the evolution equations are negligible, 



and so the solution is essentially exponential growth. At later times (t — A0/loa, Figure 2b), nonlinear terms 
become important, filaments begin to erupt outwards from the plasma, and the solution becomes more 
complicated, even turbulent. As a result the number of iterations and wall clock time increase significantly 
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Figure 1: Function norm (l 2 ) versus wall-clock time (sec) on 16 cores for two-field test case, Equation [8j 
Labels on the figure indicate cumulative number of linear solver iterations for Newton-Krylov methods. 
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Figure 2: Function norm (Z 2 ) versus wall-clock time (sec) on 512 cores for the full six-field model (Section 2.2 1 
at fixed problem size. Left: Results for the initial timestep t = 0. Right: Results for timestep t — 40. Labels 
indicating the cumulative number of linear solver iterations for Newton-Krylov methods show that the 
linearized Newton system is relatively easy to solve initially, though it becomes more difficult at timestep 
t = 40. 



for later timestcps. The physics-based preconditioner is marginally less effective at later times, reducing 
the iteration count over unpreconditioned JFNK by a factor of 67 at t — 40, compared with a factor 
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of 86 at t = 0. This result is expected because nonlinear terms were not included in construction of 
the preconditioner. Nevertheless, the linear preconditioner remains highly effective, indicating that linear 
dynamics are still limiting the timestep even when the plasma evolution is nonlinear. 



4-2. Scalability of Nonlinear Solvers 
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Figure 3: The upper left plot compares the scalability of each algorithm for the full six- field model. The 
upper right and lower left plots are close-ups of NGMRES and physics-based preconditioning, respectively. 
The lower right plot compares each algorithm with ideal scaling. We can see that NGMRES and especially 
physics-based preconditioning significantly reduced overall runtime. 

Figure [3] shows the scaling of the JOREK test case with the Jacobian-free Newton-Krylov (no pre- 
conditioning), nonlinear GMRES, and Jacobian-free Newton-Krylov (with physics-based preconditioning) 
algorithms in the first subfigure with lighter bars denoting the comparison with NGMRES and physics-based 
preconditioning. Notice that the physics-based preconditioning is barely legible in the first subfigure be- 
cause of the overall runtime being an order of magnitude faster than the default solver. For this reason the 
individual plots of NGMRES and physics-based preconditioning are also displayed as individual subfigures. 
All three algorithms then are plotted in a log- log graph in conjunction with ideal scaling (the dashed lines). 
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Figure 4: Results showing that function evaluation (RHS) dominates the time for the six-field model 

One can see that all three algorithms have almost identical scaling. This relatively small test case becomes 
too small for core counts larger than 128, as evidenced by the tapering of scalability. 

Clearly, using a physics-based preconditioner achieves the fastest runtime. Physics-based approaches, 
however, require detailed knowledge of the modeling equations being used [HJ [571 03] , en f° rcm g a tight 
coupling between the programmer and physics-based preconditioning analysis. Even if such a coupling 
exists, potential challenges arise with implementation because of complexities of physics and preconditioning 
algorithms. In contrast, NGMRES requires no application-specific customization, since the application 
provides code only for the nonlinear function evaluation. 

The bar graphs in Figure[4]show the proportion of time spent in various phases of the nonlinear solve. For 
each run, the evaluation of the function (or right-hand side, RHS) dominates the runtime. For Jacobian-free 
variants, a function evaluation is required for each Jacobian- vector product (one per Krylov iteration). The 
preconditioned variant of JFNK requires fewer function evaluations because of better-conditioned linear sys- 
tems. NGMRES, alternatively, needs roughly ten times more function evaluations than does preconditioned 
JFNK. Also, we can see that the time needed for computing the search direction in NGMRES (VecNorm) 
is nontrivial. This result further indicates that the parallel scaling shown in Figure [3] is due primarily to the 
scaling of the nonlinear function evaluation; improving this scaling is the subject of ongoing work. 

5. Conclusions 

We have introduced a physics-based preconditioner that leverages insight into the physics of the equations 
being solved in order to improve convergence of matrix-free Newton-Krylov methods in BOUTH — h ELM 
simulations. We also introduced the use of a complementary approach, nonlinear GMRES, which requires 
no Jacobian-vector products or preconditioning. We demonstrated that both approaches reduce time for 
solving the nonlinear systems that arise at each timestep of an implicit ELM simulation. 

The physics-based preconditioning scheme described in this paper is straightforward to construct once the 
analytic Jacobian is known, can be efficiently solved in magnetic field-aligned coordinate systems commonly 
used in plasma simulations, and is highly effective for both hyperbolic (Alfven wave) and parabolic (heat 
conduction) problems. The preconditioning steps have intuitive interpretations, making it straightforward 
to experiment with adding and removing terms in the equations in order to optimize the preconditioner. 
By applying this method to a six-field reduced MHD model, order-of-magnitude reductions in wall clock 
time have been achieved, even in regimes where nonlinear terms in the evolution equations are important. 
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The application of this improved code to ELM simulations, and predictions for ITER, will be reported in a 
separate paper. 

Using a relatively small problem size, we have demonstrated good scaling up to 512 processors. Since 
the costly matrix inversion step in the preconditioner is trivially parallelized in two dimensions (i/j and £), 
increasing the problem size should allow scaling to a much larger number of processors. Testing this will be 
the subject of future work. 

Other directions of future work include the development of additional physics-based preconditioning 
variants, such as fieldsplits for fast Alfven waves and thermal conductivity along the ficldlines. We are 
also exploring broader issues in time-dependent simulations, including the use of implicit-explicit (IMEX) 
approaches for flexible timestepping. Work is under way to incorporate BOUT++ into the FACETS frame- 
work [321 [SOj in order to facilitate research on core-edge-wall coupling. 
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